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Abstract 

We study Nystrom type subsampling approaches to large scale kernel methods, 
and prove learning bounds in the statistical learning setting, where random sampling 
and high probability estimates are considered. In particular, we prove that these ap¬ 
proaches can achieve optimal learning bounds, provided the subsampling level is suit¬ 
ably chosen. These results suggest a simple incremental variant of Nystrom Kernel 
Regularized Least Squares, where the subsampling level implements a form of com¬ 
putational regularization, in the sense that it controls at the same time regularization 
and computations. Extensive experimental analysis shows that the considered ap¬ 
proach achieves state of the art performances on benchmark large scale datasets. 


1 Introduction 

Kernel methods provide an elegant and effective framework to develop nonparametric sta¬ 
tistical approaches to learning [1]. However, memory requirements make these methods 
unfeasible when dealing with large datasets. Indeed, this observation has motivated a 
variety of computational strategies to develop large scale kernel methods [2-8]. 

In this paper we study subsampling methods, that we broadly refer to as Nystrom ap¬ 
proaches. These methods replace the empirical kernel matrix, needed by standard kernel 
methods, with a smaller matrix obtained by (column) subsampling [2,3]. Such procedures 
are shown to often dramatically reduce memory/time requirements while preserving good 
practical performances [9-12]. The goal of our study is two-fold. First, and foremost, we 
aim at providing a theoretical characterization of the generalization properties of such 
learning schemes in a statistical learning setting. Second, we wish to understand the role 
played by the subsampling level both from a statistical and a computational point of view. 
As discussed in the following, this latter question leads to a natural variant of Kernel Reg¬ 
ularized Least Squares (KRLS), where the subsampling level controls both regularization 
and computations. 
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From a theoretical perspective, the effect of Nystrom approaches has been primarily 
characterized considering the discrepancy between a given empirical kernel matrix and its 
subsampled version [13-19]. While interesting in their own right, these latter results do 
not directly yield information on the generalization properties of the obtained algorithm. 
Results in this direction, albeit suboptimal, were first derived in [20] (see also [21,22]), 
and more recently in [23,24]. In these latter papers, sharp error analyses in expectation 
are derived in a fixed design regression setting for a form of Kernel Regularized Least 
Squares. In particular, in [23] a basic uniform sampling approach is studied, while in [24] 
a subsampling scheme based on the notion of leverage score is considered. The main tech¬ 
nical contribution of our study is an extension of these latter results to the statistical learn¬ 
ing setting, where the design is random and high probability estimates are considered. 
The more general setting makes the analysis considerably more complex. Our main result 
gives optimal finite sample bounds for both uniform and leverage score based subsam¬ 
pling strategies. These methods are shown to achieve the same (optimal) learning error 
as kernel regularized least squares, recovered as a special case, while allowing substantial 
computational gains. Our analysis highlights the interplay between the regularization and 
subsampling parameters, suggesting that the latter can be used to control simultaneously 
regularization and computations. This strategy implements a form of computational regu¬ 
larization in the sense that the computational resources are tailored to the generalization 
properties in the data. This idea is developed considering an incremental strategy to ef¬ 
ficiently compute learning solutions for different subsampling levels. The procedure thus 
obtained, which is a simple variant of classical Nystrom Kernel Regularized Least Squares 
with uniform sampling, allows for efficient model selection and achieves state of the art 
results on a variety of benchmark large scale datasets. 

The rest of the paper is organized as follows. In Section 2, we introduce the setting and 
algorithms we consider. In Section 3, we present our main theoretical contributions. In 
Section 4, we discuss computational aspects and experimental results. 

2 Supervised learning with KRLS and Nystrom approaches 

Let X X M be a probability space with distribution p, where we view X and M as the 
input and output spaces, respectively. Let px denote the marginal distribution of p on X 
and p(-|x) the conditional distribution on M given x G X. Given a hypothesis space H of 
measurable functions from X to E, the goal is to minimize the expected risk, 

(fW- 4 )^dp(x,y), (1) 

-XxR 

provided p is known only through a training set of (xi,yi)|hi sampled identically and 
independently according to p. A basic example of the above setting is random design 
regression with the squared loss, in which case 

Pi = f*(xi)-f Ci, i = l,...,rL, (2) 

with f* a fixed regression function, ci,..., Cn a sequence of random variables seen as 
noise, and xi,..., Xn random inputs. In the following, we consider kernel methods, based 


minf(f), 

ten 
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on choosing a hypothesis space which is a separable reproducing kernel Hilbert space. 
The latter is a Hilbert space T-L of functions, with inner product (•, such that there 
exists a function K : X x X —> M with the following two properties: 1) for all x E X, 
kx(-) = k(x, •) belongs to T-L, and 2) the so called reproducing property holds: f(x) = 
(f> for all f E X € X [25]. The function K, called reproducing kernel, is easily 

shown to be symmetric and positive definite, that is the kernel matrix (KiM)i j = K(xi,Xj) is 
positive semidefinite for all xi,..., xn E X, N e N. A classical way to derive an empirical 
solution to problem (1) is to consider a Tikhonov regularization approach, based on the 
minimization of the penalized empirical functional, 

1 

min -y (f(xi) -yi)^ + A||f|||^,A > 0. (3) 

fGri TL 

1=1 

The above approach is referred to as Kernel Regularized Least Squares (KRLS) or Kernel 
Ridge Regression (KRR). It is easy to see that a solution to problem (3) exists, it is 
unique and the representer theorem [1] shows that it can be written as 

TL 

^a(^) = ^ &ik(xi,x) with & = (Kn + Anl]“’y, (4) 

i=l 

where xi,..., Xn are the training set points, y = (y i,..., rjn) and K^ is the empirical kernel 
matrix. Note that this result implies that we can restrict the minimization in (3) to the 
space, 

n 

"Hn = {f G ^ I f = ^ aiK(xi, •], ai,..., an E M}. 

i=l 

Storing the kernel matrix Kn, and solving the linear system in (4), can become computa¬ 
tionally unfeasible as n increases. In the following, we consider strategies to find more 
efficient solutions, based on the idea of replacing Hn with 

m 

= {f I f = X ^ 

i=l 

where m < n and {xi,..., Xml is a subset of the input points in the training set. The 
solution f;\ ni of the corresponding minimization problem can now be written as, 

m 

= Y- with a = (K^^Knm + ArLKTT^nx)^K^my) (5) 

i=l 

where At denotes the Moore-Penrose pseudoinverse of a matrix A, and (Knra)ij = k(xi, xj), 
(Kram)kj = K(xic,Xj) with 1 E {1,...,n} and j,k E {1,... ,m} [2]. The above approach is 
related to Nystrom methods and different approximation strategies correspond to differ¬ 
ent ways to select the inputs subset. While our framework applies to a broader class of 
strategies, see Section C.l, in the following we primarily consider two techniques. 

Plain Nystrom. The points (xi,..., Xra} are sampled uniformly at random without replace¬ 
ment from the training set. 
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Approximate leverage scores (ALS) Nystrom. Recall that the leverage scores associated 
to the training set points xi,..., Xn are 

(li(t))r=i, lt(t] = (Kn(Kn + trLl)“^)u, iG{l,...,rL} (6) 

for any t > 0, where = K(xi, xj). In practice, leverage scores are onerous to compute 

and approximations can be considered [16,17,24] . In particular, in the following 

we are interested in suitable approximations defined as follows: 

Definition 1 (T-approximate leverage scores). Let (li,(t)]|hi be the leverage scores associ¬ 
ated to the training set for a given t. Let 5 > 0, to > 0 and T > 1. We say that are 

T-approximate leverage scores with confidence 6, when with probability at least 1 — 5, 

:|rh(t) < ti(t) < Tli(t) Vi € {1,..., n}, t > to. 

Given T-approximate leverage scores for t > Aq, {xi,..., x^n,} are sampled from the 
training set independently with replacement, and with probability to be selected given by 

Pt(i)=ti(t)/Ljtj(t). 

In the next section, we state and discuss our main result showing that the KRLS formu¬ 
lation based on plain or approximate leverage scores Nystrom provides optimal empirical 
solutions to problem (1). 

3 Theoretical analysis 

In this section, we state and discuss our main results. We need several assumptions. The 
first basic assumption is that problem (1) admits at least a solution. 

Assumption 1. There exists an G 7^ such that 

= min£:(f). 
few 

Note that, while the minimizer might not be unique, our results apply to the case in 
which is the unique minimizer with minimal norm. Also, note that the above condition 
is weaker than assuming the regression function in (2) to belong to TL. Finally, we note 
that the study of the paper can be adapted to the case in which minimizers do not exist, 
but the analysis is considerably more involved and left to a longer version of the paper. 
The second assumption is a basic condition on the probability distribution. 

Assumption 2. Let be the random variable z^=y — with x G X, and y distributed 

according to p(y|x). Then, there exists M, a > 0 such that for any 

p > 2, almost everywhere on X. 

The above assumption is needed to control random quantities and is related to a noise 
assumption in the regression model (2) . It is clearly weaker than the often considered 
bounded output assumption [25], and trivially verified in classification. 

The last two assumptions describe the capacity (roughly speaking the “size’T of the hy¬ 
pothesis space induced by K with respect to p and the regularity of with respect to K 
and p. To discuss them, we first need the following definition. 
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Definition 2 (Covariance operator and effective dimensions). We define the covariance 
operator as 


C:n 


n, 




f(x]g(x]dpx(x] , 
X 


yfgen. 


Moreover, for A > 0, we define the random variable A/’xlA) = (Kx, (C + AI) ^ ^x)-^ with x G X 
distributed according to px and let 


A((A) = EA(x(A), A(oo(A) = supA4(A). 

xex 

We add several comments. Note that C corresponds to the second moment operator, 
but we refer to it as the covariance operator with an abuse of terminology. Moreover, note 
that A((A) = Tr(C(C + AI)“^) (see [26]). This latter quantity, called effective dimension or 
degrees of freedom, can be seen as a measure of the capacity of the hypothesis space. The 
quantity AfooW can be seen to provide a uniform bound on the leverage scores in Eq. (6). 
Clearly, A/’(A) < A/’oo(A) for all A > 0. 

Assumption 3. The kernel K is measurable, C is bounded. Moreover, for all A > 0 and a 
Q>o, 


A(oo(A) < oo, (7) 

A((A)<QA-^, 0<y<l. (8) 

Measurability of K and boundedness of C are minimal conditions to ensure that the 
covariance operator is a well defined linear, continuous, self-adjoint, positive operator 
[25]. Condition (7) is satisfied if the kernel is bounded sup,^gx < oo, indeed 

in this case A^ootA) < k^/A for all A > 0. Conversely, it can be seen that condition (7) 
together with boundedness of C imply that the kernel is bounded, indeed ^ 

k^<2||C||A(oo(||C||). 

Boundedness of the kernel implies in particular that the operator C is trace class and 
allows to use tools from spectral theory. Condition (8) quantifies the capacity assumption 
and is related to covering/entropy number conditions (see [25] for further details). In 
particular, it is known that condition (8) is ensured if the eigenvalues (oJi of C satisfy a 

polynomial decaying condition Ut ~ i i'. Note that, since the operator C is trace class. 
Condition (8) always holds for y = 1. Here, for space constraints and in the interest of 
clarity we restrict to such a polynomial condition, but the analysis directly applies to other 
conditions including exponential decay or a finite rank conditions [26]. Finally, we have 
the following regularity assumption. 

Assumption 4. There exists s > 0, 1 < R < oo, such that ||C^®f^||^ < R. 

The above condition is fairly standard, and can be equivalently formulated in terms of 
classical concepts in approximation theory such as interpolation spaces [25]. Intuitively, 

hfAfoofA) is finite, then AfoodlCII) = sup,^gxlKC + ||C||I)^’'Kx|d > 1/2||C|C'sup,^gxl|Rx||^, therefore 
K(x,x) <2||C||Woc(||C||). 
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it quantifies the degree to which can be well approximated by functions in the RKHS 
T-L and allows to control the bias/approximation error of a learning solution. For s = 0, it 
is always satisfied. For larger s, we are assuming to belong to subspaces of T-L that are 
the images of the fractional compact operators C®. Such spaces contain functions which, 
expanded on a basis of eigenfunctions of C, have larger coefficients in correspondence 
to large eigenvalues. Such an assumption is natural in view of using techniques such 
as (4), which can be seen as a form of spectral filtering, that estimate stable solutions by 
discarding the contribution of small eigenvalues [27]. In the next section, we are going 
to quantify the quality of empirical solutions of Problem (1) obtained by schemes of the 
form (5), in terms of the quantities in Assumptions 2, 3, 4. 


3.1 Main results 


In this section, we state and discuss our main results, starting with optimal finite sample 
error bounds for regularized least squares based on plain and approximate leverage score 
based Nystrom subsampling. 


Theorem 1. Under Assumptions 1, 2, 3, and 4, let b > 0, v = min(s, 1 /2), p = 1+1 /(2v+y) 
and assume 

6k^ /38p IMk^p^^ 


TL > 1 655k^ + 223k‘^ log —-—h 


|C| 


log- 


l|C||6 


Then, the following inequality holds with probability at least 1 — 8 , 


2v+1 


^(fA,m)-^(f+) < q TT- 2 V+Y+ 1 , with q = 6R |^2||C|| + 

with fy^ra tLS in (5), A = ||C||n 2 v+y+i and 
1. for plain Nystrom 


Mk 




+ 


Qfy 

IlCllt' 


log-, (9) 


m > (67 V5A/'oo(A)) log 


12Kfy 


2. forALS Nystrom and 1-approximate leverage scores with subsampling probabilities P^, 

48rL 


to > ^ log and 


m> (334V78TW(A))log- 


We add several comments. First, the above results can be shown to be optimal in a 
minimax sense. Indeed, minimax lower bounds proved in [26,28] show that the learn¬ 
ing rate in (9) is optimal under the considered assumptions (see Thm. 2, 3 of [26], for 
a discussion on minimax lower bounds see Sec. 2 of [26]). Second, the obtained bounds 
can be compared to those obtained for other regularized learning techniques. Techniques 
known to achieve optimal error rates include Tikhonov regularization [26,28,29], iterative 
regularization by early stopping [30,31], spectral cut-off regularization (a.k.a. principal 
component regression or truncated SVD) [30,31], as well as regularized stochastic gradi¬ 
ent methods [32]. All these techniques are essentially equivalent from a statistical point 
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Figure 1; Validation errors associated to 20 x 20 grids of values for m (x axis) and A (y 
axis) on pumadyn32nh (left), breast cancer (center) and cpuSmall (right). 


of view and differ only in the required computations. For example, iterative methods al¬ 
low for a computation of solutions corresponding to different regularization levels which 
is more efficient than Tikhonov or SVD based approaches. The key observation is that 
all these methods have the same O(n^) memory requirement. In this view, our results 
show that randomized subsampling methods can break such a memory barrier, and con¬ 
sequently achieve much better time complexity, while preserving optimal learning guar¬ 
antees. Finally, we can compare our results with previous analysis of randomized kernel 
methods. As already mentioned, results close to those in Theorem 1 are given in [23,24] 
in a fixed design setting. Our results extend and generalize the conclusions of these papers 
to a general statistical learning setting. Relevant results are given in [8] for a different ap¬ 
proach, based on averaging KRLS solutions obtained splitting the data in m groups (divide 
and conquer RLS). The analysis in [8] is only in expectation, but considers random design 
and shows that the proposed method is indeed optimal provided the number of splits is 
chosen depending on the effective dimension Af(A). This is the only other work we are 
aware of establishing optimal learning rates for randomized kernel approaches in a sta¬ 
tistical learning setting. In comparison with Nystrdm computational regularization the 
main disadvantage of the divide and conquer approach is computational and in the model 
selection phase where solutions corresponding to different regularization parameters and 
number of splits usually need to be computed. 

The proof of Theorem 1 is fairly technical and lengthy. It incorporates ideas from [26] 
and techniques developed to study spectral filtering regularization [30,33]. In the next 
section, we briefly sketch some main ideas and discuss how they suggest an interesting 
perspective on regularization techniques including subsampling. 

3.2 Proof sketch and a computational regularization perspective 

A key step in the proof of Theorem 1 is an error decomposition, and corresponding bound, 
for any fixed A and m. Indeed, it is proved in Theorem 2 and Proposition 2 that, for 5 > 0, 
with probability at least 1 — 6, 

< R log^+RC(m)^/2+^+RA’/2+V (10) 
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The first and last term in the right hand side of the above inequality can be seen as forms 
of sample and approximation errors [25] and are studied in Lemma 4 and Theorem 2. The 
mid term can be seen as a computational error and depends on the considered subsampling 
scheme. Indeed, it is shown in Proposition 2 that C(m) can be taken as, 


Cpi(m) = min < t > 0 


(67 V5A/'oo(t)) log < m 


for the plain Nystrom approach, and 


CALs(m) = min 


19k2 , 

-log 

n 


12n 


< t < 


C 


78T^A7(t) log < m|, 


for the approximate leverage scores approach. The bounds in Theorem 1 follow by: 1) 
minimizing in A the sum of the first and third term 2) choosing m so that the computa¬ 
tional error is of the same order of the other terms. Computational resources and regu¬ 
larization are then tailored to the generalization properties of the data at hand. We add a 
few comments. First, note that the error bound in (10) holds for a large class of subsam¬ 
pling schemes, as discussed in Section C.l in the appendix. Then specific error bounds can 
be derived developing computational error estimates. Second, the error bounds in The¬ 
orem 2 and Proposition 2, and hence in Theorem 1, easily generalize to a larger class of 
regularization schemes beyond Tikhonov approaches, namely spectral filtering [30]. For 
space constraints, these extensions are deferred to a longer version of the paper. Third, 
we note that, in practice, optimal data driven parameter choices, e.g. based on hold-out 
estimates [31], can be used to adaptively achieve optimal learning bounds. 

Finally, we observe that a different perspective is derived starting from inequality (10), 
and noting that the role played by m and A can also be exchanged. Letting m play the role 
of a regularization parameter, A can be set as a function of m and m tuned adaptively. For 
example, in the case of a plain Nystrom approach, if we set 


logm 

A =-, 

m 


1 

and m = Sn^v+T'+i logn, 


then the obtained learning solution achieves the error bound in Eq. (9) . As above, the sub¬ 
sampling level can also be chosen by cross-validation. Interestingly, in this case by tuning 
m we naturally control computational resources and regularization. An advantage of this 
latter parameterization is that, as described in the following, the solution corresponding 
to different subsampling levels is easy to update using Cholesky rank-one update formu¬ 
las [34]. As discussed in the next section, in practice, a joint tuning over m and A can be 
done starting from small m and appears to be advantageous both for error and computa¬ 
tional performances. 


4 Incremental updates and experimental analysis 

In this section, we first describe an incremental strategy to efficiently explore different 
subsampling levels and then perform extensive empirical tests aimed in particular at: 1) 
investigating the statistical and computational benefits of considering varying subsampling 
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Input: Dataset Subsampling (xj)pp 

Regularization Parameter A. 

Output: Nystrbm KRLS estimators {ai,..., cCra}- 
Compute Yi; Ri <- ^tT; 
for t G {2,..., m} do 
Compute At,ut,vt; 

/Rt-I oy Rt <-cholup(Rt,Ut/+'); 
^ y 0 Oy’ Rt <—cholup(Rt, Vt/—0; 
at^Ry^Ry(A7y]); 

end for 

Algorithm 1: Incremental Nystrbm KRLS. 



m 


Figure 2: Model selection time 
on the cpuSmall dataset, m G 
[1,1000] and T = 50, 10 repeti¬ 
tions. 


levels, and 2) compare the performance of the algorithm with respect to state of the art 
solutions on several large scale benchmark datasets. Throughout this section, we only 
consider a plain Nystrbm approach, deferring to future work the analysis of leverage scores 
based sampling techniques. Interestingly, we will see that such a basic approach can often 
provide state of the art performances. 

4.1 Efficient incremental updates 

Algorithm 1 efficiently computes solutions corresponding to different subsampling levels, 
by exploiting rank-one Cholesky updates [34]. The proposed procedure allows to effi¬ 
ciently compute a whole regularization path of solutions, and hence perform fast model 
selection^ (see Sect. A). In Algorithm 1, the function cholup is the Cholesky rank-one up¬ 
date formula available in many linear algebra libraries. The total cost of the algorithm is 
0(nm^+m^) time to compute dci,..., am, while a naive non-incremental algorithm would 
require 0(rLm^T + m^T] with T is the number of analyzed subsampling levels. The follow¬ 
ing are some quantities needed by the algorithm: Ai = ai and At = (At_i Ut) G for 

any 2 < t < m. Moreover, for any 1 < t < m, gt = vTTtg and 

Ut = (Ct/(1 + gt), gt), Ut = (K(xt,XiK(xt,Xn)), Ct = a7_iU t + Anbt, 

vt = (ct/(l +gt), -1), bt = (K(xt,xiK(xt,xt-i)), yt = a^Ut + ArLK(xt,Xt). 


4.2 Experimental analysis 

We empirically study the properties of Algorithm 1, considering a Gaussian kernel of width 
CT. The selected datasets are already divided in a training and a test part^. We randomly 
split the training part in a training set and a validation set (80% and 20% of the n training 
points, respectively) for parameter tuning via cross-validation. The m subsampled points 
for Nystrbm approximation are selected uniformly at random from the training set. We 

^The code for Algorithm 1 is available at Icsl.github.io/NystromCoRe. 

^In the following we denote by n the total number of points and by d the number of dimensions. 
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Table 1: Test RMSE comparison for exact and approximated kernel methods. The results 
for KRLS, Batch Nystrom, RF and Fastfood are the ones reported in [6]. Utr is the size of 
the training set. 


Dataset 

Jl-tr 

d 

Incremental 
Nystrom RBF 

KRLS 

RBF 

Batch 

Nystrom RBF 

RF 

RBF 

Fastfood 

RBF 

Fastfood 

FFT 

KRLS 

Mater n 

Fastfood 

Matern 

Insurance Company 

5822 

85 

0.23180 ±4 X 10-= 

0.231 

0.232 

0.266 

0.264 

0.266 

0.234 

0.235 

CPU 

6554 

21 

2.8466 ± 0.0497 

7.271 

6.758 

7.103 

7.366 

4.544 

4.345 

4.211 

CT slices (axial) 

42800 

384 

7.1106 ±0.0772 

NA 

60.683 

49.491 

43.858 

58.425 

NA 

14.868 

Year Prediction MSD 

463715 

90 

0.10470±5x 10-= 

NA 

0.113 

0.123 

0.115 

0.106 

NA 

0.116 

Forest 

522910 

54 

0.9638 ±0.0186 

NA 

0.837 

0.840 

0.840 

0.838 

NA 

0.976 


report the performance of the selected model on the fixed test set, repeating the process 
for several trials. 

Interplay between A and m. We begin with a set of results showing that incrementally 
exploring different subsampling levels can yield very good performance while substan¬ 
tially reducing the computational requirements. We consider the pumadyn32nh (n = 8192, 
d = 32), the breast cancer (n = 569, d = 30), and the cpuSmall (n = 8192, d = 12) 
datasets^. In Figure 1, we report the validation errors associated to a 20 x 20 grid of 
values for A and m. The A values are logarithmically spaced, while the m values are lin¬ 
early spaced. The ranges and kernel bandwidths, chosen according to preliminary tests 
on the data, are ct = 2.66, A G [10~^,1], m G [10,1000] for pumadyn32nh, cr = 0.9, 
A G [lO^^^, 10^^], m G [5,300] for breast cancer, and a = 0.1, A G [10~^^, 10~^^], 
m G [100,5000] for cpuSmall. The main observation that can be derived from this first 
series of tests is that a small m is sufficient to obtain the same results achieved with the 
largest m. For example, for pumadyn32nh it is sufficient to choose m = 62 and A = 10^^ to 
obtain an average test RMSF of 0.33 over 10 trials, which is the same as the one obtained 
using m = 1000 and A = 10^^, with a 3-fold speedup of the joint training and validation 
phase. Also, it is interesting to observe that for given values of A, large values of m can 
decrease the performance. This observation is consistent with the results in Section 3.1, 
showing that m can play the role of a regularization parameter. Similar results are ob¬ 
tained for breast cancer, where for A = 4.28 x 10 ® and m = 300 we obtain a 1.24% 
average classification error on the test set over 20 trials, while for A = 10~^^ and m = 67 
we obtain 1.86%. For cpuSmall, with m = 5000 and A = 10~'^ the average test RMSF 
over 5 trials is 12.2, while for m = 2679 and A = 10^^^ it is only slightly higher, 13.3, but 
computing its associated solution requires less than half of the time and approximately 
half of the memory. 

Regularization path computation. If the subsampling level m is used as a regularization 
parameter, the computation of a regularization path corresponding to different subsam¬ 
pling levels becomes crucial during the model selection phase. A naive approach, that 
consists in recomputing the solutions of Fq. 5 for each subsampling level, would require 
0(m^nT + m^LT) computational time, where T is the number of solutions with different 
subsampling levels to be evaluated and L is the number of Tikhonov regularization param- 

^www.cs.toronto.edu/~delve and archive.ics.uci.edu/ml/datasets 
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eters. On the other hand, by using the incremental Nystrom algorithm the model selection 
time complexity is 0(m^Ti + m^L) for the whole regularization path. We experimentally 
verify this speedup on cpuSmall with 10 repetitions, setting m E [1,5000] and T = 50. The 
model selection times, measured on a server with 12 x 2.10GHz Intel® Xeon® E5-2620 
v2 CPUs and 132 GB of RAM, are reported in Figure 2. The result clearly confirms the 
beneficial effects of incremental Nystrom model selection on the computational time. 
Predictive performance comparison. Finally, we consider the performance of the algo¬ 
rithm on several large scale benchmark datasets considered in [6], see Table 1 . a has been 
chosen on the basis of preliminary data analysis, m and A have been chosen by cross- 
validation, starting from small subsampling values up to mraax = 2048, and considering 
A E [10~^^, 1]. After model selection, we retrain the best model on the entire training set 
and compute the RMSF on the test set. We consider 10 trials, reporting the performance 
mean and standard deviation. The results in Table 1 compare Nystrom computational 
regularization with the following methods (as in [6]): 

• Kernel Regularized Least Squares (KRLS): Not compatible with large datasets. 

• Random Fourier features (RF): As in [4], with a number of random features D = 
2048. 

• Fastfood RBF, FFT and Matern kernel: As in [6], with D = 2048 random features. 

• Batch Nystrom: Nystrom method [3] with uniform sampling and m = 2048. 

The above results show that the proposed incremental Nystrom approach behaves really 
well, matching state of the art predictive performances. 
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A The incremental algorithm 


Let be the dataset and be the selected Nystrdm points. We want to 

compute a of Eq. 5, incrementally in m. Towards this goal we compute an incremental 
Cholesky decomposition Rt for t G {1,..., m} of the matrix Gt = + AnKtt, and the 

coefficients at by at = Note that, for any 1 < t < m — 1, by assuming 

Gt = for an upper triangular matrix Rt, we have 


Gt+i — 



Ct+l\ 

Tt+iy 


/Rt 0\^/Rt 

Vo o) Vo 


+ Ct+i 


with Ct+i 


f 0 Ct+i \ 

W+^ Tt+iy ’ 


and Ct+i, Yt+i as in Section 4.1. Note moreover that Gi = yi- Thus if we decompose the 
matrix Ct+i in the form Ct+i = ut+iuy|_i —vt+ivy|_^ we are able compute Rt+i, the Cholesky 
matrix of Gt+i, by updating a bordered version of Rt with two rank-one Cholesky updates. 
This is exactly Algorithm 1 with Ut+i and Vt+i as in Section 4.1. Note that the rank-one 
Cholesky update requires O(t^) at each call, while the computation of Ct requires O(Tit) 
and the ones of at requires to solve two triangular linear systems, that is 0(t^ -f nt). 
Therefore the total cost for computing aa, ■ ■ ■, ccm is 0(nm^ -f m^). 


B Preliminary definitions 

We begin introducing several operators that will be useful in the following. Let z^,... ,Zm G 
Ti and for all f € a G let 

Zraf = ((zi,f)^,... , (Znt,f)^), 

Z> = L{^iatZt. 

Let Sn = -^^m. and S* = :J^Z^ the operators obtained taking m = n and zt = Kx^, 
Vi = 1,..., n in the above definitions. Moreover, for all f, g G ^ let 

1 

Cn-.n^n, (f, Cng)^ = - y f(Xi)g(Xi). 

n. ^- 

i=l 

The above operators are linear and finite rank. Moreover Cn = S^Sn and Kn = nSnS*, 
and further Bnm = V^SnZ*^ G Gnxm = e and Kn = BnraCLmBXn^ G 

]^nxn_ 


C Representer theorem for Nystrom computational regulariza¬ 
tion and extensions 

In this section we consider explicit representations of the estimator obtained via Nystrom 
computational regularization and extensions. Indeed, we consider a general subspace T-L-m 
of H, and the following problem 

1 

fA,m = argmin - Y" (f(xi)-gt)^-L A||f|||^. (11) 

feWm ^ “T 
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In the following lemmas, we show three different characterizations of 

Lemma 1. Let fA^ra be the solution of the problem in Eq. (11). Then it is characterized by the 
following equation 

(PmCnPm + (12) 

with Pm, the projection operator with range TLm. and 

Proof The proof proceeds in three steps. First, note that, by rewriting Problem (11) with 
the notation introduced in the previous section, we obtain, 

fA,m = argmin ||Snf-Pnf + A||f|||^. (13) 

fe«m 

This problem is strictly convex and coercive, therefore admits a unique solution. Second, 
we show that its solution coincide to the one of the following problem, 

= argmin ||SnPmf-ynf+ A||f|||^. (14) 

few 

Note that the above problem is again strictly convex and coercive. To show that fA^m = 
let = a + b with a E TL-m and b E TL^. A necessary condition for to be optimal, is 
that b = 0, indeed, considering that Pmb = 0, we have 

IlSnPmf* -yn||^ + A||r|||^ = HSnPmd “ Pnll^ + A|| a|||^ + A||b |||^ > ||SnPma “ bn ||^ + A|| a|||^. 

This means that f* E TLm., but on 4(m the functionals defining Problem (13) and Prob¬ 
lem (14) are identical because Pmf = f for any f E 4fm and so fA^m = f*- Therefore, by 
computing the derivative of the functional of Problem (14), we see that fA^m is given by 
Eq. (12). ’ □ 

Using the above results, we can give an equivalent representations of the function fA_m- 
Towards this end, let Zm be a linear operator as in Sect. B such that the range of Zm is 
exactly TLm.. Morever, let 

Zm = UIV* 

be the SVD of Zm where U : I : R* —> R^, V : R^ —> t < m and I = 

diag(ai,..., cTt) with ui > • • • > Ot > 0, U*U = h and V*V = h. Then the orthogonal 
projection operator Pm is given by Pm = W* and the range of Pm is exactly TLm.- In the 
following lemma we give a characterization of fA,m that will be useful in the proof of the 
main theorem. 

Lemma 2. Given the above definitions , fA^m can be written as 

^A,m = V(V*CnV + Al)-V*s;yn. (15) 

Proof By Lemma 1, we know that fA^m is written as in Eq. (12). Now, note that fA^m = 
PmfA,m and Eq. (12) imply (PmCmPm + AI)PmfA,m = PmSnbn, that is equivalent to 

V(V*CnV +AI]V*fA,m = W*s;yn, 


15 


by substituting Pra with W*. Thus by premultiplying the previous equation by V* and 
dividing by V*CTaV + Al, we have 


= (V*C^V + Al)-V*s;yn 


Finally, by premultiplying by V, 

fA,m = Pmfym = V(V*Ca^V + Al)” 


□ 

Finally the following result provide a characterization of the solution useful for com¬ 
putations. 

Lemma 3 (Representer theorem for fA,m)- Given the above definitions, we have that 
can be written as 

m 

= Y_ with a = + AnGTaiu)^B^,^y V x G X. (16) 

i=i 

Proof. According to the definitions of Bnm and Gram we have that 

a = (B;[,^Bnm + AnGarm)^B;[,^y = ((ZrrrS;)(SnZ;,) +A(ZarZ;,))t(ZarS;)yn. 
Moreover, according to the definition of Zra we have 

m 

fA,m(^) ~ y (^i) l^x) — (-Zm^X) 5t)Rm = (K^, Zra&)'H V X G X, 

1=1 


SO that 

h,m = ZUUmS*n){SnZ*J + MZmZ*J)HZmS*JVn = ZUZrnCnxZ*jHZmS*n)yn, 

where CnA = Cn + AL Let F = UI, G = V*CnV + Al, H = lU^, and note that F, GH, 
G and H are full-rank matrices, then we can perform the full-rank factorization of the 
pseudo-inverse (see Eq.24, Thm. 5, Chap. 1 of [1]) obtaining 

(ZaiCnAZ;,)^ = (FGH)t = Ht(FG)^ = HtG-’Ft = UI-HV*CnV +AI)-’l-^UL 

Finally, simplyfing U and I, we have 

= ZUZmCn?.Z*jHZ^S*Jyn 
= VIU*UI-^ (V*CnV + Al)-^l-’u*uiv*s;yn 
= V(V*CnV + Al)-V*S;yn. 


□ 
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C.l Extensions 


Inspection of the proof shows that our analysis extends beyond the class of subsampling 
schemes in Theorem 1. Indeed, the error decomposition Theorem 2 directly applies to a 
large family of approximation schemes. Several further examples are described next. 

KRLS and Generalized Nystrom In general we could choose an arbitrary ^ T-L. Let 
Ztti : T-L —> be a linear operator such that 

^n^ = ranZ;, = {f If = Z>, (17) 

Without loss of generality, Z^ is expressible as Z^ = (zi,..., Zra)^ with zi,..., Zttl G "H, 
therefore, according to Section B and to Lemma 3, the solution of KRLS approximated 
with the generalized Nystrom scheme is 

m 

^ aiZi(x), with a= + (18) 

i=l 

with Bnm G (Bnm)ij = Zj(xj and Gram G (GTnm]tj = (zi,Zj)^, or equiva¬ 

lently 


m 

fA,m(^) = &iZl(x), a = + ArLl)^yn.) = Bn.TaGTTLmBnm (19) 

i=l 

The following are some examples of Generalized Nystrom approximations. 

Plain Nystrom with various sampling schemes [2-4] For a realization s : N — > {1 ,..., n} 
of a given sampling scheme, we choose Zra = with S^ = where 

is the training set. With such Z^ we obtain Kn = ^tid so Eq. (18) 

becomes exactly Eq. (5). 

Reduced rank Plain Nystrom [5] Let p > m, Sp as in the previous example, the linear 
operator associated to p points of the dataset. Let Kpp = SpSp G that is (Kpp]q = 

K(xi,Xj). Let Kpp = OiUiuT its eigenvalue decomposition and Lira = (ui,... ,Uai,). 
Let (Kpp)ra = U^KppUai be the m-rank approximation of Kpp. We approximate this fam¬ 
ily by choosing = Li^Sp, indeed we obtain Kn = Knmhlm(i-l^KppUna)iU^KX,,^ = 
knm (i^pp) m ^nm ■ 

Nystrom with sketching matrices [6] We cover this family by choosing Zm = RmSn, 
where Sn is the same operator as in the plain Nystrom case where we select all the 
points of the training set and Rm a m x n sketching matrix. In this way we have Kn = 
KnR|^(RTnRnRm)^RmRn> that is exactly the SPSD sketching model. 
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D Probabilistic inequalities 


In this section we collect five main probabilistic inequalities needed in the proof of the 
main result. We let px denote the marginal distribution of p on X and p(-|x) the conditional 
distribution on M given x G X. Lemmas 6, 7 and especially Proposition 1 are new and of 
interest in their own right. 

The first result is essentially taken from [7]. 


Lemma 4 (Sample Error). Under Assumptions 1, 2 and 3, for any & > 0, the following 
holds with probability 1 — 5 


(C + 




log- 


Proof The proof is given in [7] for bounded kernels and the slightly stronger condition 

J — 1 )dp(y |x) < a^/M^ in place of Assumption 2. More precisely, note 

that 

1 

(C + - Cnfw) = - Z 

^ i=l 

where Ci, • • •, Cn ate i.i.d. random variables, defined as Ci = (C + — f?^(xi)). 

For any 1 < i < n. 




(C + AD-i/^KxJyi 

XxR 


(C + AI)-i/^Kx^ 

X 



f^(xi))dp(xi,yi) 


fH(xi)]dp(yi|xi)dpx(xi) 


0 , 


almost everywhere by Assumption 1 (see Step 3.2 of Thm. 4 in [7]). In the same way we 
have 


E||Ct||P = ||(C + Air^/%(yi-f^(xt))||Mp(xi,yO 


Xxl 


[C + AI] |yi-f^(xi)Pdp(yi|xi)dpx(xi) 


<sup||(c+Air^/2Kxir^ ii(c+Air^/%j 

xex Jx 

< lp!^aW(A)^MxAUA)F-^ 


lyi-fw(xi]pdp(yi|xOdpx(xi) 


where sup,^gxlKC + AI]“'/^Kx|| = -\/A(oo(A) and Jxll(C + AI)“^/^KxJ|^ = A((A) by Assump¬ 
tion 3, while the bound on the moments of y — f (x) is given in Assumption 2. Finally, to 
concentrate the sum of random vectors, we apply Prop. 11. □ 


The next result is taken from [8]. 

Lemma 5. Under Assumption 3, for any 8 > 0 and ^ log T < A < II C||, the following 
inequality holds with probability at least 1 — 5, 

||(Cn+AI]-’/^C^/2|| < ||(Cn + AI)-^/2((^^^j^l/2|| <2. 
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Proof. Lemma 7 of [8] gives an the extended version of the above result. Our bound on A 
is scaled by because in [8] it is assumed k < 1. □ 


Lemma 6 (plain Nystrom approximation). Under Assumption 3, let ] be a partition of 
{1,..., n} chosen uniformly at random from the partitions of cardinality m. Let A > 0, for 
any 8 > 0, such that m > 67 log ^ V 5A/’oo(A) log the following holds with probability 

1 -5 

||(I-P^)(C + A1)1/2||"<3A, 

where Pra is the projection operator on the subspace ra = span{Kxj | j G J}. 

Proof. Define the linear operator Cra '.Li ^ Li, as Cra = Hjej ® . Now note that 

the range of Cra is exactly Lim.- Therefore, by applying Prop. 3 and 7, we have that 

11(1 - < AlKCrrr + AD-’/^C^/^f < 

with (3(A) = Amax — Cra)C)C''^^^. To uppetbound we need an upperbound 

for (3(A). Considering that, given the partition J, the random variables Cj = ® Kx^ are 

i.i.d., then we can apply Prop. 8, to obtain 


2w 

(3(A) < — + 
3m 


2wA/'oo(A) 


m 


where w = log with probability 1 — 8. Thus, by choosing m > 67w V 5A/’oo(A)w, we 
have that (3(A) <2/3, that is 

||(I-Pm)Ci/"f <3A. 

Finally, note that by definition Tr(C) < K^. □ 

Lemma 7 (Nystrom approximation for ALS selection method). Let (ti(t)){L, be the collec¬ 
tion of approximate leverage scores. Let A > 0 and P^ be defined as P^ (i) = li(A) / Ij (A) 
for any i G N with N = {1, ..., n}. Let J = (ii,..., ira) be a collection of indices indepen¬ 
dently sampled with replacement from N according to the probability distribution P^. Let Pra 
be the projection operator on the subspace Li-m = span{Kxj |j G J} and J be the subcollection 
of 3 with all the duplicates removed. Under Assumption 3, for any 8 > 0 the following holds 
with probability 1 — 28 

||(l-Par)(C + AI)^/2|| <3^^ 

when the following conditions are satisfied: 

1. there exists a T > 1 and a Aq > 0 such that (ti(t))]Ti are J-approximate leverage scores 
for any t > Aq (see Def 1 ), 

2. n > 1655k^ + 223k^ log 

3. AoV^^log^ <A< ||C||, 

4. m > 334 log ^ V 78T2A((A) log 
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Proof. Define t = 6/4. Next, define the diagonal matrix H G with (H]u = 0 when 
PA(i) = 0 and (H)u = when PA(i) > 0, where q(i) is the number of times the index 

i is present in the collection J. We have that 


s;HSn = 


m 


L 

i=l 


-^K 

PA(i) 




q(i) 


L 


K. 


K. 


Now, considering that > 0 for any j G J, thus ran S* HSn = T-i-m- Therefore, by using 
Prop. 3 and 7, we exploit the fact that the range of P^^i is the same of S* HSn, to obtain 

11(1 - P^)(C + A1)1/2||" < A||(S;HSn + Air'/2(C + A1)1/2||2 < 

with p(A] = Amax . Considering that the function (1 — x)“^ is 

increasing on —oo < x < 1, in order to bound A/(l — |3(A)] we need an upperbound for 
|3(A). Here we split |3(A) in the following way, 

P(A] < An,ax (c/^/^(C - Cn)C;’/^) + An,ax - S;HSn]C/^/^) . 

'-V-' '-V-" 

|3i(A) P2(A) 


Considering that Cn is the linear combination of independent random vectors, for the first 
term we can apply Prop. 8, obtaining a bound of the form 


2w 

(3i(A)< —+ 
3n 



> 


with probability 1 — t, where w = log^ (we used the fact that Afoo(A) < k^/A). Then, 

1 /2 

after dividing and multiplying by C^/^, we split the second term pi (A) as follows: 


P2(A) < 
< 
< 


ic: 


('/^(Cn-s;HSn)C;,^/^l 


|p-l/2pl/2p 

I '“A '“nA '^nA 


1/2, 


IC ' c 


tlA 


Cn-s;Hs , 
(Cn-S;HSn 


p—l/2pl/2^ 

'^nA '^nA '“A 


1 / 2 | 


l/2^1/2"2"C;;;/"(Cn - SCHSnlC:^/"' 


"tlA 


Let 

P3(a) = i|c;^'"(Cn-s;Hs„)c;^'^ll = ||c;;''"s;(i-h)s„c;,;'"||. 

Note that SnC“;,[S* = Kn(Kn + Anl)"^ indeed Cfl = (S*Sn + Al)“' and 
Therefore we have 


( 20 ) 

nSnS;. 


SnCflS*^ = Sn(S;Sn + AI]”’ = (SnS^ + AI]-’ SnS^ = (K^ + Kn. 

Thus, if we let ULU^ be the eigendecomposition of K^, we have that (Kn + ArLl)“^Kn = 
U(I + ArLl)“^LU^ and thus SnC^p[S(^ = U(I + ArLl)~^IU^. In particular this implies that 

SnC/^S* = with Qn = (I + Anl)"^ L. Therefore we have 

p3(A) = iic^y"s;(i-H)SnC^yyi = nQy^u’^d-HjuQyyi, 
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where we used twice the fact that ||ABA*|| = ||(A*A]'/^B(A*A)'/^|| for any bounded linear 
operators A, B. 

Consider the matrix A = and let at be the i-th column of A, and be the r-th 

canonical basis vector for each I G N. We prove that ||ai||^ = li(A), the true leverage score, 
since 

llmf = IIQy^U^etf = e^^UQ^U^et = ((K^ + Anlj-^ K^k = li(A). 

Noting that 


(33(A) = ||AA 


T 


1 

m 


L 

iea 


PA(i: 


■at a 




Moreover, by the T-approximation property of the approximate leverage scores (see Def. 1), 
we have that for all i G {1,..., n}, when A > Aq, the following holds with probability 1 — 5 


PA(i) = 


li(A) 

Ljtj(A) 


> T- 


' li(A) 

Lkj(A) 


= r 


Ui 


TrAAT’ 


Then, we can apply Prop. 9, so that, after a union bound, we obtain the following inequal¬ 
ity with probability 1 — 6 — t: 


2||A||kog^ 
133(A) < II II ^ ^ + 

3m 


/2||A|pkTr AA^" log ^ 2 log 


< 


m 


3m 


2n 

^ + 


'2T2Xr(A)log 


in 

T 


m 


where the last step follows from II A|p = ||(Kn + ArLl) 'Kn|| < 1 andTr(AA^) =Tr(CyCn) 
X/’(A). Applying Proposition 1, we have that X)’(A) < 1.3Af (A) with probability 1 — t, when 
log ^ < A < ||C|| and n > 405k V 67k log k. Thus, by taking a union bound again. 


19k^ 


we have 


133(A) < 


21og^ /5.3kA7(A)logy5^ 


3m 


+ 


m 


with probability 1 — 2 t — 5 when Aq V log f < A < ||C|| and n > 405k V 67k log 
The last step is to bound \\C^^\ \^, as follows 

= iicy/"CnAcy/yi = iii+cy/ycn-c)cy/y < i + q, 

with q = ||Cy'^^(Cn — C)Cy^^||. Note that, by applying Prop. 8 we have that q < 


— 3 Ari-^ V 1^1 ^1^^ probability 1 — t and 0 = log Finally, by collecting the above 

results and taking a union bound we have 


2w 

P(A)<^ + 

3n 


2wk^ 

An 


+ (1 +b) 


2 log 


3m 


in 

^ + 


/5.3TW(A)log 


in 

T 


TTL 


with probability 1 — 4 t — 5 = 1 — 25 when Aq V log ^ < A < ||C|| and n > 405k V 
67k log Note that, if we select n > 405k V 223k log m > 334log Aq V 
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< A < lieII and log ^ ^ -j conditions are satisfied and we have 

(3(A) < 2/3, so that 

||(I-Pn^)(C + AI)l/2f <3A, 

with probability 1—26. □ 

Proposition 1 (Empirical Effective Dimension). LetM[\) =TrCnC~;,[. Under the Assump¬ 
tion 3, for any 6 > 0 and n > 405 k^ V 67K^log^, if -^^log^ < A < ||C||, then the 
following holds with probability 1 — 6, 


|Ar(A)-Af(A)| 

Af(A) 


< 4.5q + (1 + 9q) 



+ 


q + 13.5q^ 
Af(A) 


< 1.65, 


. , 4k^ log f 

Proof Let t = 6/3. Define . Choosing A in the range log ^ < 

A < ||C||, Prop. 8 assures that Aniax(13n) < 1 /3 with probability 1 — t. Then, using the fact 
that Cfl = — Bn)“^ (see the proof of Prop. 7) we have 

\/f[\) -A7(A)| = |TrC;;|Cn - = ATr C;;[(Cn - | 

= lATr (I - Bn)-^ C/^/"(Cn - C)C-’/^C-^/^| 

= |ATrC-'/^(I-Bn)-’ 

Considering that for any symmetric linear operator X : the following identity holds 

(I - X)“^X = X + X(I - X)^^X, 


when Aniax(X) < 1, we have 

A| Tr C-^/^ (1 - Bn)-^ BnC-’/^l < A| Tr C/^/^BnC-’/^l 

'-V-' 

A 

+ A|TrC;/^/X(I-Bn)-^ BnC;/^/^|. 

'-V-' 

B 

To find an upperbound for A define the i.i.d. random variables pi = (Kx^, AC/^KxJ G 
M with i G {1,..., n}. By linearity of the trace and the expectation, we have M = Epi = 
E(Kx„AC/^KxJ =ETr(AC/^Kx, ®KxJ =ATr(C/^C). Therefore, 


A|TrC/'/^BnC-^/^| 



and we can apply the Bernstein inequality (Prop. 10) with 

|M-pi| < A||C/2 ||||Kx, f + M < ^ + M < ^ = L, 
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]E(tii - M)^ = Erif - < Erjf < LM = cr^. 

An upperbound for M is M = Tr(AC^^C) = Tr((I — C)C^' C] < AA(A). Thus, we have 


A|TrC-^/^BnC;^/^| < 


4k^ log I / 4 kW(A) loT^ 
3 Ati V An 


with probability 1 — t. 

To find an upperbound for B, let £ be the space of Hilbert-Schmidt operators onH. C 
is a Hilbert space with scalar product (U, V ),_(5 = Tr(UV*) for all U, V G £. Next, note 

that B = IIQIIhs where Q = (I — moreover 

IIQII^HS < ||Al/'C-'/2f||BnfHsllU-Bn]-'/'f < l-SllBnll^HS, 

since ||(1 — = (1 — Aniax(Bn))~' < 3/2 and (1 — is increasing and positive 

on [— oo, 1). 

To find a bound for llBnllus consider that B^ = T— £ where £ are i.i.d. random 

operators defined as Ct = (8) G £ for all i G {1,...,n}, and T = ECi = 

C/' C G £. Then we can apply the Bernstein’s inequality for random vectors on a Hilbert 
space (Prop. 11), with the following L and 

||T - Cl IIhs < ||Kx, fu + IITIIhs < y + I|T||hs < ^ = L, 

E||Ci -Tf = ETr(Cf-T2) < ETr(C?) < LETr(Ci) = 
where ||T||hs < ETr(Ci) = AC(A), obtaining 


IB 


nllHS < 


4k^ log; 
An 


+ 


I 4k^AC(A) log 


An 


with probability 1 — t. Then, by taking a union bound for the three events we have 

2 

iXriA) -AC(A)| < q + \/3qAC(A) + 1.5 (3q + V3qAC(A)) , 

4k^ log — 

with q = —with probability 1—5. Finally, if the second assumption on A holds, 
then we have q < 4/57. Noting that n > 405k^, and that AC(A] > ||CC/^ || = ^ 1 

we have that 


□ 
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E Proofs of main theorem 


A key step to derive the proof of Theorem 1 is the error decomposition given by the 
following theorem, together with the probabilistic inequalities in the previous section. 

Theorem 2 (Error decomposition for KRLS+Ny). Under Assumptions 1, 3, 4, let v = 
min(s, 1 /2] and a KRLS + generalized Nystrom solution as in Eq. (18). Then for any 
A, m > 0 the error is bounded by 

5(A,n) + ), (21) 

Sample error Computational error Approximation error 

where 5(A,n) = ||(C + AI)-’/2(S;yn - andC(m) = ||(I - P,n)(C + with 

Moreover q = R((32 V (1 + 0|3)], |3 = |KCn + Air’/2(C + AI]V2||^ 

0 = ||(Cn + AI)l/2(C + AI)-l/2||. 

Proof. Let Ca = C + AI and CnA = Cn + AI for any A > 0. Let fA^iri as in Eq. (18). 
By Lemma 1, Lemma 2 and Lemma 3 we know that fA^ra is characterized by Ia^tr = 
gAm(Cn]Sj;pn with gA,ra(Cn) = V(V*CnV +AI)”! V*. By using the fact that £:(f) - £:(fH) = 
||CV2^f _ f^)|||^ for any f gTL (see Prop. 1 Point 3 of [7]), we have 

= \\C^^Htym-fn)\\n = l|C'/'(gA,m(Cn)S;yn-f«)||« 

= l|C''^^(gA,m(Cn)S* (pn — Snf^ + Snf^) — f?t)||w 
< ||C'/2gA,m(Cn)S;(yn " Snf«)b + - gA,m(Cn) Cn)f« ||h ■ 

^-V-' ^-V-' 

A B 


1 n 

Bound for the term A Multiplying and dividing by 

A < l|cv^c-/^||||c:/^gA,^(cjc:/^||||c;^^^ 

where the last step is due to Lemma 8 and the fact that 

l|cV2c;;/"ll < ||cV2c-^A||||cf 

Bound for the term B Noting that gA,m(CTL)CnAVV* = 


and we have 
S;(yn-Snf«)||w < l5^S(A,n), 

. Ilpl/2p-l/2|| 

— II'“A '“nA 11- 

= W*, we have 


i gA,m(Cn]Cn — I gA,Ta(Cn)CnA “f ^gA,Ta(Cn] 

= I - gA,m(Cn)CnAW* - gA,m(Cn) CnA (I - W*) + AgA,m(Cn) 

= (I-W*)+ AgA,nx(Cn) - gA,m(Cn)CnA(I - W*). 

Therefore, noting that by Ass. 4 we have < ||Cp^®f^||^ < ||C“®f^||^ < R, then, 

by reasoning as in A, we have 

B < ||CC2(I-g,^,,^(Cn)Cn)C);||||C;^''f^||« 

<R||CC2c-'A||||^ii/2^j_^,^^v|| + R^||CV2c-iA||||cC2g^^^^^^^^v^ 

+ R||cC2c;;/^llllc;/'g,,,,,(Cn]cy,'||||cy,'c-^/'|||^ 

< R(1 + |30) ||c;/"(l - W*)q|| +R(3 A||Cy,'gA,^(Cn)C);||, 

'-V-' '-V-' 

B.l B.2 
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where in the second step we applied the decomposition of I — gAm(Cn)Cn- 

Bound for the term B.l Since W* is a projection operator, we have that (I — W*) = 
(I — W*)®, for any s > 0, therefore 

B.l = ||cy^(i-w*]^q|| < ||c}/^(i-w*)||||(i-w*)ci(||. 

By applying Cordes inequality (Prop. 4) to ||(I — W*)Cj[|| we have, 

11(1-w*]q|| = 11(1-= ||(i-w*]cy^f''. 

Bound for the term B.2 We have 

B.2<A||c;/Vm(Cn]c(;,||||cjcj;ii 
<A||c;/'gA,m(Cn]c(;,||||c;;/'cf f'' 

< |3""A|| (V* CnAV)^(V* CnAV)-l (V* CnAV]"|| 

= (3^''A||(V*CnV + Al)-(^/2^''>|| < |3A’/^+\ 

where the first step is obtained multipling and dividing by the second step by apply¬ 
ing Cordes inequality (see Prop. 4), the third step by Prop. 6. □ 

Proposition 2 (Bounds for plain and ALS Nystrdm). For any 8 > 0, let n > 1655k^ + 
223k^ log let log ^ < A < ||C|| and define 

12k^ 1 

(67 V5Afoo(t)] log< ttl| , 

. f 1 9 k^ 12rL 

CALs(m) = min<^-log^ < t < C 

( TL O 

Under the assumptions ofThm. 2 and Assumption 2, 3, if one of the following two conditions 
hold 

1. plain Nystrdm is used, 

2. ALS Nystrdm is used with 

(a) L-approximate leverage scores, for any t > log ^ (see Def 1 ), 

(b) resampling probabilities Pt where t = CALs(Tn] (see Sect. 2), 

(c) m > 334 log 

then the following holds with probability 1 — 5 

|^(fA,m] < 6R log^ + 3RC(m)’/2+^ + 3RA'/2+'' 

( 22 ) 

where C[vrC] = Cpi(m) in case of plain Nystrdm andC[m) = Cals(ttl) in case of ALS Nystrdm. 


78T^A7(t) log < ml . 


Cpl(Tn) = min < t > 0 
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Proof. In order to get explicit bounds from Thm. 2, we have to control four quantities 
that are (3,0,5(A, n) and C(m). In the following we bound such quantities in probability 
and then take a union bound. Let t = 6/3. We can control both |3 and 0, by bounding 
b(A) = Indeed, by Prop. 7, we have that |3 < 1/(1 —b(A)), while 

02 = = III + C/’/"(Cn - C)C/^/^|| < 1 + b(A). 

Exploiting Prop. 8, with the fact that Af(A) < AZ/olA) < ^ Tr C < k^, we have that 

b (A) < for w = log ^ with probability 1 — t. Simple computations show 

that with n and A as in the statement of this corollary, we have b(A) < 1/3. Therefore 
(3 < 1.5, while 0 < 1.16 and q = R(p^ V (1 + 0|3)] < 2.75R with probability 1 — t. Next, 
we bound 5(A, n). Here we exploit Lemma 4 which gives, with probability 1 — t, 

5(A.n)<2('^» + y^)lcgL 

To bound C(m) for plain Nystrdm, Lemma 6 gives C(m) < 3t with probability 1 — t, for 
a t > 0 such that (67 V 5A/’oo('t)) log ^ < m. In particular, we choose t = Cpi(m) to 
satisfy the condition. Next we bound C(m) for ALS Nystrdm. Using Lemma 7 with Aq = 
1^ log we have C(m) < 3t with probability 1 — t under some conditions on t, m, n, 
on the approximate leverage scores and on the resampling probability. Here again the 
requirement on n is satisfied by the hypotesis on n of this proposition, while the condition 
on the approximate leverage scores and on the resampling probabilities are satisfied by 
conditions (a), (b) of this proposition. The remaining two conditions are 1^ log ^ < t < 
IICII and (334V78T2A/(t))loglfi < m. They are satisfied by choosing t = Cals(itl) and 
by assuming that m > 334 log 1^. Finally, the proposition is obtained by substituting each 
of the four quantities (3,0,5(A,n),C(m) with the corresponding upperbounds in Eq. (21), 
and by taking the union bounds on the associated events. □ 

Proof of Theorem 1. By exploiting the results of Prop. 2, obtained from the error decom¬ 
position of Thm. 2 we have that 

|£(fx,™) < SR log| + 3RC(m)'/»» + 3 raV^+v 

(23) 

with probability 1 — 6, under conditions on A, m, n, on the resampling probabilities and 
on the approximate leverage scores. The last is satisfied by condition (a) in this theorem. 
The conditions on A,n are n > 1655 k 2 +223K2log^ and < A < ||C||. If we 

assume that n > 1655 k 2 + 223 log ^ log ^ y q ^ we satisfy the condition on n 

and at the same time we are sure that A = ||C||rL“^/t2v+Y+i) satisfies the condition on A. In 
the plain Nystrdm case, if we assume that m > 67 log 1^ + 5A7oo (A) log then C (m) = 

Cpl(Tu) < A. In the ALS Nystrdm case, if we assume that m > (334 V78T2A7(A))log^ 
the condition on m is satisfied, then C(m) = Cals(ttl) < A, moreover the conditions on the 
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resampling probabilities is satisfied by condition (b) of this theorem. Therefore, by setting 
A= ||C||n“h(2v+Y+i) in Eq. (23) and considering that A(oo(A) < k^A“^ we easily obtain the 
result of this theorem. □ 

The following lemma is a technical result needed in the error decomposition (Thm. 2) . 

Lemma 8. For any A > 0, let V be such that V*V = I and Cn be a positive self-adjoint 
operator. Then, the following holds, 

||(Cn + AI)’/^V(V*CnV + Al)-V*(Cn + AI)’/^|| < 1. 

Proof Let C^a = Cn + AI and gAm(Cn) = V(V*CnV + Al)“^ V*, then 

l|Cy,'gAm(Cn)C;,/"f = ||C;^/"gA^(Cn)CnAgAm(Cn)C;,/'|| 

= II Cnl" V( V* CnAV)-l (V* CnAV) (V* CnAV)"! V* €^,^11 

= II ^nA 9ATa(Cn)C,^'^ ||, 

1/2 1/2 

and therefore the only possible values for ||C/;^ gAm(Cn)C^;;^ || are 0 or 1. □ 


F Auxiliary results 

Proposition 3. Let TL,IC,T three separable Hilbert spaces, let Z : H ^ 1C be a bounded 
linear operator and let P be a projection operator on TL such that ranP = ranZ*. Then for 
any bounded linear operator P : H ^ TL and any A > 0 we have 

||(I-P)X|| < A^/^||(Z*Z + Air^/2X||. 

Proof First of all note that A(Z*Z + AI)“^ = I — Z*[ZZ* + A1)^'Z, that Z = ZP and that 
||Z*(ZZ* + AI)“^Z|| < 1 for any A > 0. Then for any v GTLwe have 

^v,Z*(ZZ* + AI)-’Zv^ = ^v,PZ*(ZZ* +Al)-'ZPv^ = ||(ZZ* + Air^/^ZPvf 

< ||(ZZ* +Al)-’/^Zf ||Pvf < ||Pvf = (v,Pv) 

therefore P — Z*(ZZ* + AI]“'Z is a positive operator, and (I — Z*[ZZ* + AI]“'Z] — (1 — P] 
too. Now we can apply Prop. 5. □ 

Proposition 4 (Cordes Inequality [9]). Let A, B two positive semidefinite bounded linear 
operators on a separable Hilbert space. Then 

||A®B®|| < ||AB||® when0<s<1. 

Proposition 5. Let TL,lC,iF,G be three separable Hilbert spaces and let X : TL ^ K, and 
Y :TL ^ LF be two bounded linear operators. For any bounded linear operator Z : G ^ TL, if 
Y*Y — X*X is a positive self-adjoint operator then ||XZ II < l|YZ||. 
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Proof. If Y*Y — X*X is a positive operator then Z*(Y*Y — X*X)Z is positive too. Thus for 
all f € ^ we have that (f, (Q - P)f) > 0, where Q = Z*Y*YZ and P = Z*X*XZ. Thus, by 
linearity of the inner product, we have 

IIQII =sup(f,Qf) =sup{(f,Pf) + (f, (Q-P)f)}>sup(f,Pf) = ||P||. 
fee fee fee 


□ 


Proposition 6. Let TL,1C be two separable Hilbert spaces, let A ■. Li ^ Li be a positive 
linear operator, V : Li ^ )C a partial isometry and B : /C —> /C a bounded operator. Then 
||A’'VBV*A®|| < \\[V*AVYB[V*AVY\\, for alio <r,s < 1/2. 

Proof. By Hansen’s inequality (see [10]) we know that (V*AV)^^ — V*A^^V is positive 
selfadjoint operator for any 0 < t < 1/2, therefore we can apply Prop. 5 two times, 
obtaining 

||A’'V(BV*A®]|| < ||(V*AV]’'(BV*A®)|| = ||((V*AV)’'B)V*A®|| < ||((V*AV]’'B)(V*AV]®||. 

□ 


Proposition 7. Let Li be a separable Hilbert space, let A, B two bounded self-adjoint positive 
linear operators and A > 0. Then 

||(A + AI)-^/2b’/^|| < ||(A + AI)-^/^(B + AI)’/^|| < (1 - 

when 


(B + AI)-^/2(B - A)(B +AI)-'/^ 


< 1 . 


P — ^max 

Proof. Let B;^ = B + AI. First of all we have, 

||(A + AI)^’/^B^/^|| = ||(A + 

< ||(A + AI)-^/^Bi/^||||B“'/^B^/^|| < ||(A + AI)-^/^By^||, 

= vp^ ^ 


since 


< 1. Note that 


(A + AI)-1 = [(B + AI) - (B - A)]^^ 


B 


1/2 


= B 


A 

- 1/2 


I-bJ/^(B-A)B- 


-V2^ 3 1/2 


I_B-1/2(B-A)B-'/"' 


1-1 


,- 1/2 


Now let X = (I — — A)B^ ^. We have that. 


||(A + AI)-^/^bJ''^|| = ||B^/^(A + AI)-^bJ/^||^/^ 

because ||Z|| = ||Z*Z||'/^ for any bounded operator Z. Finally let Y = — A)B)/''^^ 

and assume that Aniax(Y) < 1, then 

||X|| = ||(I-Y)-l|| = (1-An,ax(Y))-\ 

since X = w(Y) with w(ct) = (1 — for — oo < a < 1, and w is positive and monotoni- 
cally increasing on the domain. □ 
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G Tail bounds 


Let IHIhs denote the Hilbert-Schmidt norm. 

Proposition 8. Let vi ,..., Vn with n > 1, be independent and identically distributed random 
vectors on a separable Hilbert spaces Li such that Q = E v (g) v exists, is trace class, and for 
any A > 0 there exists a constant A/’oo(A) < oo such that (v, (Q + AI)~^v) < A^oo(A) almost 
everywhere. Let Qn = '^1 <S) tttid take 0 < A < ||Q||. Then for any 6 > 0, the 

following holds 

IKQ + - Q„)(Q + AI)-'/^|| < 

with probability ^ — 28. Here (3 = log Moreover it holds that 

Amax ((Q + AI]-1/2(Q - Qn)(Q + AI)-1/^) < ^ + 
with probability 1 — 8. 

Proof. Let Qa = Q + AI. Here we apply Prop. 12 on the random variables Zi = M — 
^Vi (g) with M = for 1 < i < n. Note that the expectation of Zt 

is 0. The random vectors are bounded by 

< (v,q;^^) + ||q-'/'qq;’/'|| <Afoo(A] + i 

and the second orded moment is 

E(Zi)2=E (vi,Q;^1vi) 

< Afoo(A]EQ;’/^i (g = Afoo(A]Q = S. 

Now we can apply Prop. 12. Now some considerations on (3. It is (3 = log p|| = ^q-^q||^ j 
nowTrQ^'Q < need a lowerbound for ||Q^'Q|| = where Oi = ||Q|| is 

the biggest eigenvalue of Q, now A < cti thus . 

For the second bound of this proposition, the analysis remains the same except for L, 
indeed 

sup(f,Zif) = sup(f,Q;^^Qf) - < sup(f,Q^^Qf\ < 1. 

few few ^ ^ ' few ^ ' 

□ 

Remark 1. InProp. 8, letdefine = infA>oAfoolAjdlQU+A). Whenn > 405k^V67k^ log 
and ^ log ^ < A < IIQII we have that 

Amax((Q + Air'/2(Q-Qn)(Q+AI]-’/2) < 1 

with probability 1 — 8, while it is less than 1 /3 with the same probability, if log 

IIQII- 
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Proposition 9 (Theorem 2 [11]. Approximation of matrix products.). Let n, n be positive 
integers. Consider a matrix A G and denote by the i-th column of A. Let m < n and 

I = {li,..., Ira} be a subset o/ N n} formed by m elements chosen randomly with 

replacement, according to a distribution that associates the probability P(l) to the element 
1 G N. Assume that there exists a (3 G (0,1] such that the probabilities P(1),..., P(n) satisfy 
Ha- IP 

P(i) > for all I G N. For any 5 > 0 the following holds 


AA'^ 


-Y — 

P(i' 

iei 


aiU. 


T| 


< 


^Llogx ^ / 2LS log^ 
3m V 


with probability 1 — 6. Here L = ||A|p and S = -^Tr AA^. 

Proposition 10 (Bernstein’s inequality for sum of random variables). Let xi,... ,Xn be a 
sequence of independent and identically distributed random variables on M with zero mean. 
If there exists an L,S G M such that xi < L almost everywhere and Exf < S, then for any 
6 > 0 the following holds with probability 1 — 6; 


1 

n 


n 

< 

i=l 


log I / ^Slogl 
3n Y n 


If there exists an L' > |xi | almost everywhere, then the same bound, computed with L' instead 
of L, holds for the for the absolute value of the left hand side, with probability 1 — 26. 

Proof. It is a restatement of Theorem 3 of [12]. □ 


Proposition 11 (Bernstein’s inequality for sum of random vectors). Let zi,..., Zn be a 
sequence of independent identically distributed random vectors on a separable Hilbert space 
'LL. Assume q = Ezi exists and let o, M > 0 such that 

E||zi - \i\W < 

for all p > 2. Then for any t > 0.* 


1 

-Yzi-pWn < 

n r~~: 


21-log f ^ / 2gMog| 


n 


n 


with probability greater or equal 1 — 5. 

Proof, restatement of Theorem 3.3.4 of [13]. □ 


Proposition 12 (Bernstein’s inequality for sum of random operators). Let 'LL be a separable 
Hilbert space and Zef Xi,..., be a sequence of independent and identically distributed self¬ 
adjointpositive random operators on 'LL. Assume that there exists EXi = 0 and Amaxl^i) < L 
almost everywhere for some L > 0. Let S be a positive operator such that E(Xi < S. Then 
for any 5 > 0 the following holds 


A 


max 



^ 2L|3 ^ /2pl3 

“ 3n V n 
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with probability 1 — 6. Here (3 = log pp• 

If there exists an L' such that L' > ||Xi || almost everywhere, then the same bound, com¬ 
puted with L' instead ofL, holds for the operatorial norm with probability 1 — 26. 

Proof The theorem is a restatement of Theorem 7.3.1 of [14] generalized to the separable 
Hilbert space case by means of the technique in Section 4 of [15]. □ 
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